1 Broad Goals

This project’s broad goal of this project is to explore the IMDB movie dataset and address exciting issues such as the genre of movies that are mainly produced, IMDB score analysis, and profitability analysis. Moreover, if possible, I want to identify what set of features contribute to a highly rated/profitable film most significantly and try to predict a movie’s profitability.

Load the requried R packages.

2 Data Preparation

The dataset used in this project is the IMDB 5000 Movie Dataset from Kaggle, which can be accessed via this link: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset. This dataset recorded information on 5043 movies across 66 countries from 1916 to 2016. The dataset is available in a csv format file and is of size 1MB.

Each record in this dataset has 28 variables, including information such as Title of the movie'',Name of the movie director’‘, Country the movie was produced in'',Budget of the movie ($)’‘, profitability'', andIMDB score for the movie (out of 10)’’.

2.1 Data Import

Import the data and show the dimension and names of all attributes of the data. There are 5043 movies recorded in this dataset, and each of which has 28 attributes. Note that as Kaggle requires a username and password to download the dataset, I am sourcing the same data from my Github repository.

## [1] "tbl_df"     "tbl"        "data.frame"
## [1] 5043   28
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

2.2 Data Cleaning

First, removing spurious characters from the movie title, genre and plot keyword columns.

Second, remove the duplicates in the data in “movie_title” column as duplicate data may skew analysis. Thus, these 126 duplicate movie were removed.

## [1] 126

Third, there are columns that contain currency, such as the “budget” and “gross”. These columns for a few countries were not converted to US dollars, including “South Korea”, “Japan”, “Thailand”, …, etc. This might cause problems in later analysis and make the problem even more complicated, if taking inflation into consideration. Thus, only movies from USA were kept for profitability analysis.

## # A tibble: 4,998 x 4
##    movie_title                  budget country        gross
##    <chr>                         <dbl> <chr>          <int>
##  1 The Host                12215500000 South Korea  2201412
##  2 Lady Vengeance           4200000000 South Korea   211667
##  3 Fateless                 2500000000 Hungary       195888
##  4 Princess Mononoke        2400000000 Japan        2298191
##  5 Steamboy                 2127519898 Japan         410388
##  6 Akira                    1100000000 Japan         439162
##  7 Godzilla 2000            1000000000 Japan       10037390
##  8 Kabhi Alvida Naa Kehna    700000000 India        3275443
##  9 Tango                     700000000 Spain        1687311
## 10 Kites                     600000000 India        1602466
## # ... with 4,988 more rows

Then, I created a column “profit_flag” to indicate if a movie is profitable, i.e, Revenue \(>\) Budget, where 1 means profitable. As this involving both “gross” and “budget” columns, only movies from USA have non-empty value for this column.

Last, ragarding records that contain missing values, in order to keep the entire dataset as complete as possible, I decided to not remove any rows with missing data and to deal with this issue for each individual analysis. For example, when doing genre-wise analysis, records that don’t have values for genre variables are excluded from the analysis.

Columns that contain most “NA” entries are “gross”, “budget”, and “aspect_ratio”.

## # A tibble: 30 x 2
##    `Column Name`           NA_Count
##    <chr>                      <int>
##  1 gross                        874
##  2 budget                       487
##  3 aspect_ratio                 327
##  4 title_year                   107
##  5 director_facebook_likes      103
##  6 num_critic_for_reviews        49
##  7 actor_3_facebook_likes        23
##  8 num_user_for_reviews          21
##  9 duration                      15
## 10 facenumber_in_poster          13
## # ... with 20 more rows

After finishing all the cleaning process, there are 3768 rows that do not have any missing value.

## [1] 3768

Here is the preview of the data.

2.3 Data Description

The following table lists the name, type, and description of each variable in the dataset.

Name Type Description
color character Colorization: Color or Black and White
director_name character Name of the director
num_critic_for_reviews integer Number of Critical Reviews
duration integer Duration of the movie in Minutes
director_facebook_likes integer Number of FB Page Likes of Director
actor_3_facebook_likes integer Number of FB Page Likes of Actor No.3
actor_2_name character Name of Actor No.2
actor_1_facebook_likes integer Number of FB Page Likes of Actor No.1
gross integer Gross Earned in US Dollars
genres character Categorization: Action, Comedy, Drama, …, etc.
actor_1_name character Name of Actor No.1
movie_title character Title of the Movie
num_voted_users integer Number of Voted Users on IMDB
cast_total_facebook_likes integer Total FB Page Likes of of the Entire Cast
actor_3_name character Name of Actor No.3
facenumber_in_poster integer Number of the Actors Featured in the Movie Poster
plot_keywords character Keywords Describing the Plot
movie_imdb_link character IMDB Link of the Movie
num_user_for_reviews integer Number of Users who Reviewed the Movie
language character Language of the movie: English, French, Chinese, …, etc.
country character Country where the Movie was Produced
content_rating character Content rating
budget double Budget in US Dollars
title_year integer Year of Release
actor_2_facebook_likes integer Number of FB Page Likes of Actor No.2
imdb_score double IMDB Score on a Scale of 1 to 10
aspect_ratio double Aspect Ratio
movie_facebook_likes integer Number of FB Page Likes of the Film
genres_new character Edited genres
plot_keywords_new character Edited plot_keywords

3 Exploratory Data Analysis

3.1 Basic Analysis

3.1.1 Number of Movies Released per Year

3.2 Genre-wise Analysis

Construct the document-term matrix for genres.

## <<DocumentTermMatrix (documents: 4998, terms: 26)>>
## Non-/sparse entries: 14382/115566
## Sparsity           : 89%
## Maximal term length: 11
## Weighting          : term frequency (tf)

Use document-term matrix to calculate frequency for each genre.

## # A tibble: 26 x 2
##    genre     count
##    <chr>     <dbl>
##  1 drama      2571
##  2 comedy     1862
##  3 thriller   1396
##  4 action     1143
##  5 romance    1098
##  6 adventure   914
##  7 crime       883
##  8 sci-fi      611
##  9 fantasy     604
## 10 horror      556
## # ... with 16 more rows

Plot distribution of genres frequency. We can see that the top 3 movie genres are Drama, Comedy, and Thriller.

Now, we want identify which genre tend to have higher budget/gross/profit? As mentioned previously, this part only dealing with movies from USA due to the currency conversion issue.

Calculate budget, gross, and profit (= gross - budget) for each genre.

## # A tibble: 23 x 4
##    genres_new  mean_gross mean_budget mean_profit
##    <chr>            <dbl>       <dbl>       <dbl>
##  1 Action       86199950.   71082359.   15117591.
##  2 Adventure   108673001.   82183258.   26489743.
##  3 Animation   121531812.   86206545.   35325266.
##  4 Biography    46089398.   29135662.   16953735.
##  5 Comedy       54927568.   35433745.   19493823.
##  6 Crime        45009068.   34262435.   10746633.
##  7 Documentary  14683626.    4021501.   10662126.
##  8 Drama        42694443.   29676037.   13018406.
##  9 Family       95835029.   66128159.   29706870.
## 10 Fantasy      91606040.   66912956.   24693085.
## # ... with 13 more rows

Compare them using a grouped bar chart.

3.3 Country-wise Analysis

This part of the analysis aims to explory the nationality component of the data. First, we show the number of movies produced in each country through the years, as presented in the following heat map. We can see that most of the countries started producing movies in the early 2000s, except a handful which had prevalant movie production going on since the mid 1900s. Here are some insights that I’ve found: 1. US has the most thriving movie industry, and movies are being produced since the early-mid nineties. 2. Japan, Italy, germany and France being only other countries which produced significant number of movies before 1980s.

English is the language in which most movies are made, and USA produces movies in 14 languages, most by any country.

3.4 IMDB Score Analysis

In here, I have tried to see which kind of movies are more successful in terms of the IMDB ratings.

We first start by looking at the basic central tendency (mean) and the variation in movie score. For this purpose I have plotted a histogram which also has the 5th and 95th percentile mark for the IMDB score.

Summary statistics of IMDB score. Average movie IMDB Score is 6.4 and 90% of movies have a score between 8.1 and 4.3. IMDB scores follow a bell shaped distribution. So any movie having a score of more than 8.1 would be one of the top 5% movies in the world.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.800   6.600   6.441   7.200   9.500

IMDB score distribution.

Top 15 movies with highest IMDB score.

## # A tibble: 20 x 2
##    movie_title                                          avg_imdb
##    <chr>                                                   <dbl>
##  1 "Towering Inferno             "                           9.5
##  2 "The Shawshank Redemption "                               9.3
##  3 "The Godfather "                                          9.2
##  4 "Dekalog             "                                    9.1
##  5 "Kickboxer: Vengeance "                                   9.1
##  6 "Fargo             "                                      9  
##  7 "The Dark Knight "                                        9  
##  8 "The Godfather: Part II "                                 9  
##  9 "12 Angry Men "                                           8.9
## 10 "Pulp Fiction "                                           8.9
## 11 "Schindler's List "                                       8.9
## 12 "The Good, the Bad and the Ugly "                         8.9
## 13 "The Lord of the Rings: The Return of the King "          8.9
## 14 "Daredevil             "                                  8.8
## 15 "Fight Club "                                             8.8
## 16 "Forrest Gump "                                           8.8
## 17 "Inception "                                              8.8
## 18 "It's Always Sunny in Philadelphia             "          8.8
## 19 "Star Wars: Episode V - The Empire Strikes Back "         8.8
## 20 "The Lord of the Rings: The Fellowship of the Ring "      8.8

Top 15 directors with highest average IMDB score.

## # A tibble: 22 x 2
##    director_name    avg_imdb
##    <chr>               <dbl>
##  1 John Blanchard        9.5
##  2 Cary Bell             8.7
##  3 Mitchell Altieri      8.7
##  4 Sadyk Sher-Niyaz      8.7
##  5 Charles Chaplin       8.6
##  6 Mike Mayhall          8.6
##  7 Damien Chazelle       8.5
##  8 Majid Majidi          8.5
##  9 Raja Menon            8.5
## 10 Ron Fricke            8.5
## # ... with 12 more rows

3.5 Profitabilty

In order to understand the relationship between IMDB score, profit and budget, I first plotted a 3D scatter plot using popular visualization package “plotly” to try to have a big picture about it. It is an interactive plot, so we can observe the relationship between them. As previously discussed, this analysis only includes movies from USA.

From the plot, we can see that movies with higher IMDB score tend to have higher profit and significant number of movies end up losing money. Intuitively, IMDB score and groos may be correlated since people prefer to watch famous and highly-rated movies.

Commercial Success v.s. Critical Acclaim for movie from USA.

These are the top 15 movies with highest Profit, along with its profit and Return on Investment. Note that the bigger a point is, the higher its ROI is. For movies with budget over 70 millions dollars, we can observe an upward trend close to linear, which can be inferred that bigger-budget movies tend to earn more profit. However, there’s a downward trend when the budget is less than 70 millions dollars. Having a closer look at movies in this region, I found most of them produced in the 80s or early 90s, and so, their true budget should be higher with inflation being taken into consideration.

Nonetheless, the profit earned does not give a whole picture about monetary success of a movie throughout the years, so, in this case, Return on Investment is perhaps more suitable to describe the a movie’s profitability. Thus, here is top 15 Movies with highest Return on Investment for movies of at least 10 millions dollars budget.

Top 15 directors with highest average Gross Earned, Profit, and Return on Investment. (R Core Team 2019)

## # A tibble: 15 x 4
##    director_name    avg_profit avg_budget avg_ROI
##    <chr>                 <dbl>      <dbl>   <dbl>
##  1 Tim Miller             305.       58     526. 
##  2 George Lucas           277.       71.0  3902. 
##  3 Richard Marquand       277.       32.5   851. 
##  4 Irvin Kershner         272.       18    1512. 
##  5 Kyle Balda             262.       74     354. 
##  6 Colin Trevorrow        253.       75.4   385. 
##  7 Chris Buck             251.      150     167. 
##  8 Pierre Coffin          237.       72.5   324. 
##  9 Lee Unkrich            215.      200     107. 
## 10 Joss Whedon            199.      170      76.7
## 11 James Cameron          195.      124.    153. 
## 12 Roger Allers           189.       65     419. 
## 13 William Cottrell       183.        2    9146. 
## 14 Pete Docter            158.      155     108. 
## 15 Francis Lawrence       151.      121.    120.

References

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.